1 Overview

1.1 RStudio Workspace

1.2 Contents of a Rmarkdown {.Rmd}

1.3 Knitting a document

  • Knitting
  • Preview in Viewer Pane

2 Setup

2.1 Loading Packages

  1. ONLY ONCE – Install Package
  • Type: install.packages("put_package_name_here") into the console
  • e.g. install.packages("tidyverse")
  1. EVERY TIME YOU WANT TO USE – Load Package
  • Type: library("put_package_name_here") into your code chunk
  • e.g. library("tidyverse")
# Load Relevant Packages
library(tidyverse) # piping `%>%`, plotting, reading data
library(skimr) # exploratory data summary
library(naniar) # exploratory plots
library(kableExtra) # tables
library(lubridate) # for date variables
library(plotly)

# Extension
  # Try to save time by installing a package to install packages!
  # install.packages("pacman")
  # pacman::p_load(tidyverse, skimr, naniar, kableExtra, lubridate, plotly)

2.2 Loading Data

If the data is online:

This is easy for datasets that are not too large & already hosted online!

If the data is a local file:

  1. File Management
  • In general, put your data file in the same folder as your R file.
  1. Setting working directory
  • If you are knitting your document: it will automatically look for data in the same folder as your R file. So you should have no dramas (per step 1.).

  • If you are running a section of code: you will need to specify which folder the data is in. The best way to do this is by following these menu options: Session Menu >> Set Working Directory >> To Source File Location.

  1. Load Data
  • Read data using the read_csv() function which comes from the tidyverse package.
# Load Data
data = read_csv("https://www.maths.usyd.edu.au/u/UG/OL/OLEO5605/r/NYC_Drinking_Water.csv")
#data = read_csv("Drinking_Water_Quality_Distribution_Monitoring_Data.csv")

3 Exploratory Data Analysis

3.1 Quick Snapshot

# Glimpse Function [From tidyverse package]
data %>% glimpse()
## Rows: 72,709
## Columns: 10
## $ `Sample Date`                         <chr> "01/01/2015", "01/01/2015", "01/…
## $ `Sample Time`                         <time> 12:19:00, 11:15:00, 10:09:00, 1…
## $ `Sample Site`                         <chr> "1S07", "1S04", "1S03A", "1S03B"…
## $ `Sample class`                        <chr> "Operational", "Operational", "O…
## $ Location                              <chr> "SS - Shaft 7 of City Tunnel No.…
## $ `Residual Free Chlorine (mg/L)`       <dbl> 0.58, 0.71, 0.79, 0.77, 0.74, 0.…
## $ `Turbidity (NTU)`                     <dbl> 0.96, 0.94, 0.93, 0.93, 0.95, 1.…
## $ `Fluoride (mg/L)`                     <dbl> 0.79, 0.80, 0.79, 0.80, NA, NA, …
## $ `Coliform (Quanti-Tray) (MPN /100mL)` <chr> "<1", "<1", "<1", "<1", "<1", "<…
## $ `E.coli(Quanti-Tray) (MPN/100mL)`     <chr> "<1", "<1", "<1", "<1", "<1", "<…
# Skim Function [From skimr package]
data %>% skim()
Data summary
Name Piped data
Number of rows 72709
Number of columns 10
_______________________
Column type frequency:
character 6
difftime 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample Date 0 1 10 10 0 1673 0
Sample Site 0 1 4 5 0 398 0
Sample class 0 1 9 20 0 7 0
Location 0 1 29 152 0 1263 0
Coliform (Quanti-Tray) (MPN /100mL) 0 1 1 6 0 47 0
E.coli(Quanti-Tray) (MPN/100mL) 0 1 1 2 0 3 0

Variable type: difftime

skim_variable n_missing complete_rate min max median n_unique
Sample Time 2 1 17520 secs 57780 secs 10:10:00 463

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Residual Free Chlorine (mg/L) 3 1.00 0.58 0.21 -9.99 0.45 0.59 0.72 2.20 ▁▁▁▁▇
Turbidity (NTU) 506 0.99 0.74 0.28 0.07 0.62 0.73 0.86 33.80 ▇▁▁▁▁
Fluoride (mg/L) 63336 0.13 0.71 0.06 0.03 0.69 0.71 0.73 0.89 ▁▁▁▇▇
# Summary Function [From base package -- preinstalled!]
data %>% summary()
##  Sample Date        Sample Time       Sample Site        Sample class      
##  Length:72709       Length:72709      Length:72709       Length:72709      
##  Class :character   Class1:hms        Class :character   Class :character  
##  Mode  :character   Class2:difftime   Mode  :character   Mode  :character  
##                     Mode  :numeric                                         
##                                                                            
##                                                                            
##                                                                            
##    Location         Residual Free Chlorine (mg/L) Turbidity (NTU)  
##  Length:72709       Min.   :-9.9900               Min.   : 0.0700  
##  Class :character   1st Qu.: 0.4500               1st Qu.: 0.6200  
##  Mode  :character   Median : 0.5900               Median : 0.7300  
##                     Mean   : 0.5845               Mean   : 0.7385  
##                     3rd Qu.: 0.7200               3rd Qu.: 0.8600  
##                     Max.   : 2.2000               Max.   :33.8000  
##                     NA's   :3                     NA's   :506      
##  Fluoride (mg/L) Coliform (Quanti-Tray) (MPN /100mL)
##  Min.   :0.03    Length:72709                       
##  1st Qu.:0.69    Class :character                   
##  Median :0.71    Mode  :character                   
##  Mean   :0.71                                       
##  3rd Qu.:0.73                                       
##  Max.   :0.89                                       
##  NA's   :63336                                      
##  E.coli(Quanti-Tray) (MPN/100mL)
##  Length:72709                   
##  Class :character               
##  Mode  :character               
##                                 
##                                 
##                                 
## 

3.2 Exploring missingness

# vis_miss function [From visdat or naniar packages]
vis_miss(data, warn_large_data = FALSE)

3.3 Exploring numeric variables

  • What do you notice about outliers & skewness?
data %>%
  select(where(is.numeric)) %>%
  pivot_longer(everything(),
               names_to = "Variable",
               values_to = "Value") %>%
  group_by(Variable) %>%
  summarise(Mean = mean(Value, na.rm = TRUE),
            Median = median(Value, na.rm = TRUE)) %>%
  kable() %>% # putting into a table
  kable_styling(bootstrap_options = c("hover")) # making table look good
Variable Mean Median
Fluoride (mg/L) 0.7144069 0.71
Residual Free Chlorine (mg/L) 0.5844564 0.59
Turbidity (NTU) 0.7385309 0.73
p = data %>%  
  select(where(is.numeric)) %>%
  pivot_longer(everything(), 
               names_to = "Variable", 
               values_to = "Value") %>% 
  filter(!is.na(Value)) %>%
  ggplot(aes(x = Value)) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_histogram(bins = 30, fill = "lightblue", color = "darkblue") +
  labs(y = "Frequency") +
  theme_bw()

ggplotly(p)
p = data %>%  
  select(where(is.numeric)) %>%
  pivot_longer(everything(), 
               names_to = "Variable", 
               values_to = "Value") %>% 
  filter(!is.na(Value)) %>%
  ggplot(aes(y = Value)) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_boxplot(fill = "lightblue", color = "darkblue") +
  theme_bw() +
  theme(strip.text.x = element_text(size = 8))

p

4 Data Cleaning

# See what state things are currently in
data %>% glimpse()
## Rows: 72,709
## Columns: 10
## $ `Sample Date`                         <chr> "01/01/2015", "01/01/2015", "01/…
## $ `Sample Time`                         <time> 12:19:00, 11:15:00, 10:09:00, 1…
## $ `Sample Site`                         <chr> "1S07", "1S04", "1S03A", "1S03B"…
## $ `Sample class`                        <chr> "Operational", "Operational", "O…
## $ Location                              <chr> "SS - Shaft 7 of City Tunnel No.…
## $ `Residual Free Chlorine (mg/L)`       <dbl> 0.58, 0.71, 0.79, 0.77, 0.74, 0.…
## $ `Turbidity (NTU)`                     <dbl> 0.96, 0.94, 0.93, 0.93, 0.95, 1.…
## $ `Fluoride (mg/L)`                     <dbl> 0.79, 0.80, 0.79, 0.80, NA, NA, …
## $ `Coliform (Quanti-Tray) (MPN /100mL)` <chr> "<1", "<1", "<1", "<1", "<1", "<…
## $ `E.coli(Quanti-Tray) (MPN/100mL)`     <chr> "<1", "<1", "<1", "<1", "<1", "<…
# Data Cleaning

## Changing type of `Sample Date`
  # Note: mdy from [From tidyverse package]
data = data %>% 
  mutate(`Sample Date` = mdy(`Sample Date`))

# Adding additional time columns
data = data %>% 
        mutate(`Week of Year` = week(`Sample Date`),  
               `Weekday` = wday(`Sample Date`), 
               `Month Number` = month(`Sample Date`),
               `Hour` = hour(`Sample Time`))

# Giving Month Name an Order
data = data %>% 
  mutate(`Month` = factor(month.name[`Month Number`], 
                          levels = month.name))

# Converting Categorical Variables to Factors
data = data %>% 
  mutate(across(where(is_character) & !c(Location, `Sample Site`), 
         as_factor))

# Drop NA -- Q: Do you think this is appropriate?
# data = data %>%
#   drop_na()
# Check we're happy with cleaned data
data %>% glimpse()
## Rows: 72,709
## Columns: 15
## $ `Sample Date`                         <date> 2015-01-01, 2015-01-01, 2015-01…
## $ `Sample Time`                         <time> 12:19:00, 11:15:00, 10:09:00, 1…
## $ `Sample Site`                         <chr> "1S07", "1S04", "1S03A", "1S03B"…
## $ `Sample class`                        <fct> Operational, Operational, Operat…
## $ Location                              <chr> "SS - Shaft 7 of City Tunnel No.…
## $ `Residual Free Chlorine (mg/L)`       <dbl> 0.58, 0.71, 0.79, 0.77, 0.74, 0.…
## $ `Turbidity (NTU)`                     <dbl> 0.96, 0.94, 0.93, 0.93, 0.95, 1.…
## $ `Fluoride (mg/L)`                     <dbl> 0.79, 0.80, 0.79, 0.80, NA, NA, …
## $ `Coliform (Quanti-Tray) (MPN /100mL)` <fct> <1, <1, <1, <1, <1, <1, <1, <1, …
## $ `E.coli(Quanti-Tray) (MPN/100mL)`     <fct> <1, <1, <1, <1, <1, <1, <1, <1, …
## $ `Week of Year`                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Weekday                               <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ `Month Number`                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Hour                                  <int> 12, 11, 10, 10, 9, 8, 9, 11, 7, …
## $ Month                                 <fct> January, January, January, Janua…
data %>% summary()
##   Sample Date         Sample Time       Sample Site       
##  Min.   :2015-01-01   Length:72709      Length:72709      
##  1st Qu.:2016-04-03   Class1:hms        Class :character  
##  Median :2017-06-26   Class2:difftime   Mode  :character  
##  Mean   :2017-06-14   Mode  :numeric                      
##  3rd Qu.:2018-09-11                                       
##  Max.   :2019-10-31                                       
##                                                           
##                Sample class     Location         Residual Free Chlorine (mg/L)
##  Operational         :27733   Length:72709       Min.   :-9.9900              
##  Compliance          :44289   Class :character   1st Qu.: 0.4500              
##  Resample_Compliance :  472   Mode  :character   Median : 0.5900              
##  Resample_Operational:    1                      Mean   : 0.5845              
##  Entry Point         :  168                      3rd Qu.: 0.7200              
##  Re-Sample           :   33                      Max.   : 2.2000              
##  Op-resample         :   13                      NA's   :3                    
##  Turbidity (NTU)   Fluoride (mg/L) Coliform (Quanti-Tray) (MPN /100mL)
##  Min.   : 0.0700   Min.   :0.03    <1     :72436                      
##  1st Qu.: 0.6200   1st Qu.:0.69    1      :  102                      
##  Median : 0.7300   Median :0.71    >200.5 :   33                      
##  Mean   : 0.7385   Mean   :0.71    2      :   28                      
##  3rd Qu.: 0.8600   3rd Qu.:0.73    3.1    :   20                      
##  Max.   :33.8000   Max.   :0.89    4.2    :    9                      
##  NA's   :506       NA's   :63336   (Other):   81                      
##  E.coli(Quanti-Tray) (MPN/100mL)  Week of Year      Weekday     
##  <1:72704                        Min.   : 1.00   Min.   :1.000  
##  - :    1                        1st Qu.:14.00   1st Qu.:2.000  
##  1 :    4                        Median :26.00   Median :4.000  
##                                  Mean   :26.15   Mean   :4.019  
##                                  3rd Qu.:38.00   3rd Qu.:6.000  
##                                  Max.   :53.00   Max.   :7.000  
##                                                                 
##   Month Number         Hour             Month      
##  Min.   : 1.000   Min.   : 4.00   August   : 6972  
##  1st Qu.: 4.000   1st Qu.: 9.00   July     : 6851  
##  Median : 6.000   Median :10.00   May      : 6796  
##  Mean   : 6.422   Mean   : 9.68   June     : 6630  
##  3rd Qu.: 9.000   3rd Qu.:11.00   September: 6578  
##  Max.   :12.000   Max.   :16.00   January  : 6521  
##                   NA's   :2       (Other)  :32361

5 Interrogating the data

5.1 Which date had the highest Turbidity reading?

top_10_turbidity = data %>% 
  arrange(desc(`Turbidity (NTU)`)) %>% 
  select(`Sample Date`, `Turbidity (NTU)`) %>%
  head(10)

top_10_turbidity %>%
  kable(caption = "Top 10 Turbidity readings") %>%
  kable_styling(bootstrap_options = c("hover"))
Top 10 Turbidity readings
Sample Date Turbidity (NTU)
2018-01-16 33.80
2017-05-19 27.10
2019-05-17 14.10
2017-12-15 6.97
2018-01-23 6.97
2017-11-27 6.68
2017-12-13 6.29
2016-05-16 6.04
2017-12-29 5.96
2018-02-21 5.89

The highest Turbidity rating was on 2018-01-16 with a reading of 33.8.

5.2 Is there a difference between the median readings for Turbidity, Chlorine, and Fluoride for the different types of sample sites?

class_medians = data %>% 
  group_by(`Sample class`) %>%
  summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
            med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
            med_flouride = median(`Fluoride (mg/L)`, na.rm = TRUE)) 

class_medians %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover"))
Sample class med_chlorine med_turbidity med_flouride
Operational 0.69 0.75 0.71
Compliance 0.52 0.72 0.71
Resample_Compliance 0.47 0.70 NA
Resample_Operational 0.75 0.98 NA
Entry Point 0.63 0.72 0.72
Re-Sample 0.39 0.67 NA
Op-resample 0.73 0.70 0.72

5.3 Create a boxplot to visualise the difference between Entry Point and Operational levels of Residual Free Chlorine.

p = data %>% 
  filter(`Sample class` == "Entry Point" | 
         `Sample class` == "Operational") %>%
  ggplot(aes(x = `Sample class`, 
             y = `Residual Free Chlorine (mg/L)`)) +
  geom_boxplot(fill = "lightblue", color = "darkblue") +
  labs(title = "Residual Free Chlorine (mg/L) for different sample classes") +
  theme_bw()

ggplotly(p)

5.4 Which sample sites have the highest and lowest median readings for each chemical?

site_medians_wide = data %>% 
  group_by(`Sample Site`) %>%
  summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
            med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
            med_fluoride = median(`Fluoride (mg/L)`, na.rm = TRUE)) 

# Tidy Way
site_medians_long = site_medians_wide %>%
  pivot_longer(!c(`Sample Site`),
               names_to = "Median Type",
               values_to = "Median Value")

max_min_median_sites = site_medians_long %>%
  group_by(`Median Type`) %>%
    summarise(
    Min_Val = min(`Median Value`, na.rm = TRUE),
    Min_Site = paste(`Sample Site`[which(`Median Value` == Min_Val)], 
                     collapse = ", "),
    Max_Val = max(`Median Value`, na.rm = TRUE),
    Max_Site = paste(`Sample Site`[which(`Median Value` == Max_Val)], 
                     collapse = ", ")
  )

max_min_median_sites %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover"))
Median Type Min_Val Min_Site Max_Val Max_Site
med_chlorine 0.11 77750 0.91 1S03A
med_fluoride 0.69 32350 0.81 1S04
med_turbidity 0.45 3SC26 1.03 51900
# Non-Tidy Way -- copy code for each Median Type
# site_medians_wide %>%
#   select(`Sample Site`, med_turbidity) %>%
#   arrange(desc(med_turbidity)) %>%
#   filter(row_number() %in% c(1, n()))
# 
# site_medians_wide %>%
#   select(`Sample Site`, med_fluoride) %>%
#   arrange(desc(med_fluoride)) %>%
#   filter(row_number() %in% c(1, n()))
# 
# site_medians_wide %>%
#   select(`Sample Site`, med_chlorine) %>%
#   arrange(desc(med_chlorine)) %>%
#   filter(row_number() %in% c(1, n()))

5.5 Visualise the difference in readings between the top and bottom sites for Turbidity in different ways. Can you find anything interesting about the sites?

site_names = max_min_median_sites %>%
  filter(`Median Type` == "med_turbidity") %>%
  select(`Min_Site`, `Max_Site`) %>%
  t()

p = data %>% 
  filter(`Sample Site` %in% site_names) %>% 
  ggplot(aes(x = `Turbidity (NTU)`, fill = `Sample Site`)) +
    geom_histogram(bins = 30, color = "white") +
    scale_fill_manual(values = c("lightblue", "darkblue")) +
    theme_bw() +
    labs(y = "Frequency")

ggplotly(p)
p = data %>% 
  filter(`Sample Site` %in% site_names) %>% 
  ggplot(aes(x = `Sample Date`, y = `Turbidity (NTU)`, color = `Sample Site`))+
    geom_line() +
    scale_color_manual(values = c("lightblue", "darkblue")) +
    theme_bw()

ggplotly(p)

5.6 How have the median readings for each of the chemicals changed over time?

p = data %>% 
  
  group_by(`Sample Date`) %>%
  
  summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
            med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
            med_fluoride = median(`Fluoride (mg/L)`, na.rm = TRUE)) %>%
  
  pivot_longer(!c(`Sample Date`),
               names_to = "Median Type",
               values_to = "Median Value") %>%
  
  ggplot(aes(x = `Sample Date`, y = `Median Value`, colour = `Median Type`)) + 
    geom_line() +
    scale_color_manual(values = c("#ab5f54", "lightblue", "darkblue")) +
    theme_bw()

ggplotly(p)

6 Independent research question: Do all water quality sites adhere to national health regulations?

The United States Environmental Protection Agency (EPA) publishes regulations for what makes water safe to drink. Here are the regulations for some of the features contained in the NYC drinking water quality dataset:

My research question is whether all sample sites and sample types adhere to these requirements for safe drinking water. For the sake of data completeness, I will focus only on the observations which have non-missing data for all variables:

data_complete <- data %>%
  tidyr::drop_na()

Since three of these five metrics require month-level aggregation for analysis, I will analyse the data at the level of each month. First, I will rename column names to have periods instead of spaces using make.names:

data_complete <- data_complete %>%
  # Replace spaces with periods in column names
  rename_all(make.names)

It was also observed that one data point has “-” as the value for Coliform and E.Coli, indicating this data point is missing. Coincidentally, this observation is also missing fluoride levels. This observation will be dropped for the sake of data completeness:

data_agg <- data_complete %>%
  # Extract year from date
  mutate(Year = year(Sample.Date)) %>%
  # Create binary present/absent variables for Coliform and E.Coli
  mutate(Coliform_Positive = Coliform..Quanti.Tray...MPN..100mL. != "<1",
         E_Coli_Positive = E.coli.Quanti.Tray...MPN.100mL. != "<1") %>%
  group_by(Month.Number, Year, Sample.Site, Sample.class) %>%
  summarise(num_obs = n(),
            Chlorine_Good = all(Residual.Free.Chlorine..mg.L. < 4),
            Fluoride_Good = all(Fluoride..mg.L. < 4),
            Turbidity_Good = all(Turbidity..NTU. < 1) & (sum(Turbidity..NTU. <= 0.3) / num_obs) >= 0.95,
            Coliform_Good = sum(Coliform_Positive)/n() < 0.05,
            E_Coli_Good = sum(E_Coli_Positive)/n() < 0.05)

First, I will visualize the distribution of each of the five testing conditions across all observations by month:

library(cowplot)
theme_set(theme_cowplot())
data_agg %>%
  pivot_longer(cols = c(Chlorine_Good:E_Coli_Good),
               names_to = "Condition",
               values_to = "Result") %>%
  mutate(Condition = str_replace_all(Condition, "_Good", "")) %>%
  ggplot(data=., mapping=aes(x=Condition)) +
  geom_bar(aes(fill=Result)) +
  coord_flip() +
  xlab("Condition Tested") +
  ylab("Observation by Month")

As the stacked bar plot shows, most observations adhere to guidelines for fluoride, E.coli, coliform, and chlorine; however, only a minority of monthly observations adhere to turbidity guidelines.

This minority of samples meeting all requirements can be examined further:

library(DT)
data_agg %>%
  filter(Chlorine_Good, 
         Fluoride_Good,
         Turbidity_Good,
         Coliform_Good,
         E_Coli_Good) %>%
  arrange(Sample.Site, Year, Month.Number) %>%
  DT::datatable()

There are a total of 79 observations by month that fulfill all of the requirements for chlorine, fluoride, turbidity, coliform, and E.coli. We can break this down by site to see if these are all different site (i.e. 79 different sites/observation types), or if some sites consistently meet these requirements across different months:

data_agg %>%
  group_by(Sample.Site, Sample.class) %>%
  mutate(Number_of_Time_Points = n()) %>%
  filter(Chlorine_Good, 
         Fluoride_Good,
         Turbidity_Good,
         Coliform_Good,
         E_Coli_Good) %>%
  dplyr::summarise(Number_of_Compliant_Time_Points = n(),
                   Percent_Compliant_Time_Points = 100*round(n()/Number_of_Time_Points, 4)) %>%
  distinct() %>%
  arrange(desc(Number_of_Compliant_Time_Points)) %>%
  kable(.) %>%
  kable_styling(full_width = F)
Sample.Site Sample.class Number_of_Compliant_Time_Points Percent_Compliant_Time_Points
3SC26 Operational 20 45.45
1SCL1 Operational 11 22.00
33450 Compliance 11 57.89
33950 Compliance 10 52.63
11750 Compliance 8 44.44
31550 Compliance 4 22.22
18650 Compliance 3 15.79
30150 Compliance 3 16.67
32350 Compliance 3 15.79
1SCH3 Operational 2 5.00
1SCH3 Entry Point 1 100.00
1SCL1 Entry Point 1 100.00
1SCL1 Op-resample 1 50.00
31750 Compliance 1 5.26

As the table shows, there are a total of 14 sample site and sample class combinations that met the EPA requirements for safe drinking water at one or more time points. The top two entries are both operational samples from two different sites (3SC26 and 1SCL1). The latter of the two (site 1SCL1) also met all requirements for entry point and op-resample classes as well.

7 Executive summary

7.1 Live lab worksheet component

In this document, NYC drinking water quality data was explored using data wrangling and visualization techniques. Data completeness was assessed by tabulating NA values with summary functions and using naniar::vis_miss(), which highlighted that data in the Fluoride (mg/L) column was missing in 87.11% (63,336) of observations. Given this fact, it would likely not be appropriate to unilaterally drop any observations with NA values using tidyr::drop_na(), as that would wipe out 87.11% of the data right off the bat based on Fluoride (mg/L) alone. Since the majority of data in this column is missing, imputation is likely not a tractable option either; for downstream analysis, if this feature is necessary to include and ~9,000 observations are sufficient, then tidyr::drop_na() could be used. However, if fluoride levels are not necessary to include, then this feature can be dropped; the only other feature with missing data is Turbidity (NTU), in which 506 observations (0.7%) are missing. In this case, imputation techniques could be used to fill in those missing data and render the dataset complete.

As the boxplots indicate, there were 100+ outliers in each of the evaluated variables (fluoride content, residual free chlorine content, and turbidity), with each feature exhibiting visibly skewed distribution. Fluoride was negative skewed, as evidenced by more outliers with values lower than the interquartile range (IQR). While residual free chlorine had one outlier with a large-magnitude negative observation, it is likely that this value is due to measurement error as it is impossible to have a negative quantity of chlorine; barring that one questionable observation, this variable exhibited a positive skew, given most outliers were greater than the IQR. For turbidity, it is difficult to be certain whether there are truly more outliers above the IQR or if there is overplotting for values below the IQR; at face value, it appears that turbidity also has a positive skew.

For data cleaning, several new features were generated based on the Sample Date column to denote hour of the day, day of the week, week number out of the calendar year, and month number out of the calendar year. Categorical variables were additionally encoded as factors. The data were then aggregated to summarise the dates with the top 10 turbidity ratings; to compare median readings for fluoride, chloride, and turbidity across sample sites; to compare the distribution of chlorine levels across sample sites; to calculate which sample sites have the highest versus lowest levels of fluoride, chloride, and turbidity; and to visualize temporal trends in the median chloride, fluoride, and turbidity across all sample sites.

7.2 Independent research question component

In the research question component, it was investigated whether sample sites and observation types met United States EPA requirements for safe drinking water. EPA guidelines for chlorine, fluoride, turbidity, Coliform, and E.coli were independently sourced and are described above. In order to directly compare sample types and classes for these features, only complete observations were included (i.e. any rows with NA values were dropped). Data were aggregated at the month level to work with month-based requirements (e.g. <5% of samples with detected E.coli per month) and the number of compliant sample types and samples were visualized as a stacked bar chart using ggplot2::ggplot(). The month-aggregated data were then filtered to only those observations for which all conditions were met, with the results displayed using an interactive HTML table with DT::DT(). These data were further aggregated to summarise the number of compliant months per observation site and sample type, with the results displayed using a static HTML table with knitr::kable().